Is PyTorch 2.0 Faster Than PyTorch 1.13?

In this article, we make a comparison between the current stable version of PyTorch 1.13 with the latest announced PyTorch 2.0 to see which performs better.
Atharva Ingle
Created on January 24|Last edited on July 9
Comment
﻿PyTorch 2.0 was announced in early December 2022, with the main highlight being torch.compile, providing significant speed improvements with just one line addition of code. As you can see below, it works rather well, providing almost 30% reduction in training time compared to the current stable version of PyTorch (1.13)
﻿
infer/total_time
infer/total_time
2.0_fp16_reduce-overhead2.0_fp162.0_fp321.13_fp161.13_fp32050100150
train/total_time
train/total_time
2.0_fp16_reduce-overhead2.0_fp162.0_fp321.13_fp161.13_fp3205001,0001,500
Training and Inference times (in seconds)6
﻿
﻿
Want to try it out? This article will walk you through the installation of PyTorch 2.0, how to use it, and finally, end with a comparison of PyTorch 2.0 and 1.13 to dig into these improvements.
Here's what we'll cover: 
Table of contentsBackground to PyTorch 2.0Install PyTorch 2.0The Dataset We'll be Experimenting withMethodologyResultsDetailed breakdown of the resultsFailed experimentsConclusion
﻿
﻿
Let's dive in! 
Background to PyTorch 2.0Since its inception, PyTorch has been a key building block for the majority of breakthroughs in artificial intelligence. Over the last few years, the PyTorch team and contributors have iterated PyTorch from 1.0 to the most recent stable version — 1.13. PyTorch has also been moved to Linux Foundation, where it will be managed by the newly-chartered PyTorch Foundation, with support from several large companies, including Meta, AWS, NVIDIA, AMD, Google, and Microsoft.
On 2 December 2022, PyTorch announced PyTorch 2.0, their first step towards the next generation of 2-series. The eager mode–which users love for its simplicity and flexibility–is still available in PyTorch 2.0, but it also has a new compile mode that can significantly speed up your executions.
Although there were several announcements made throughout the conference, torch.compile was a major attraction. It's a fully optional feature (users are free to leverage it or not), making PyTorch 2.0 100% backward compatible. Therefore, you need not worry about changing any of your code to make it compatible with the 2.x versions. However, you can enjoy an extra performance by wrapping your model into torch.compile.
In this article, we will be looking at how to: 
✅ Install PyTorch 2.0
✅ Use PyTorch 2.0
✅ PyTorch 1.13 and PyTorch 2.0 comparisons
✅ Things that don't work in PyTorch 2.0 for now (that will be supported in the near future)
﻿
Note: The primary goal of this article is to present benchmark results for the current stable version of PyTorch (i.e. 1.13) and the nightly 2.0 version. 
I strongly advise watching the PyTorch conference and reading the blog post that goes along with it if you want to learn more about the specifics of how things operate on a technical level. You will gain a deeper understanding of the components of PyTorch 2.0 as a result.
💡
﻿PyTorch 2.0 Conference﻿
﻿PyTorch 2.0 KeyNote (a clip from the conference where all the main stuff was announced)
﻿PyTorch 2.0 getting started blog post﻿
Now that you know how PyTorch 2.0 works (if you've gone through the resources above, that is!) it's time to take it for a ride. Let's tuck in:﻿﻿
Install PyTorch 2.0The first step is to install PyTorch 2.0. It is available in the nightly version, and there are plans to release it as a stable version in March 2023.
Note: You might need to select the nightly version wheel based on your CUDA version. For the machine this code was run, the CUDA version was 11.8. 
💡
You can install PyTorch 2.0 either via pip or conda like so:
pip
pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu118
pip install transformers datasets evaluate wandb omegaconf sentencepiece
conda
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia
The Dataset We'll be Experimenting withFor this benchmark, we will be using the IMDB dataset from the HuggingFace datasets library. Here are a few samples of the dataset visualized with W&B Tables.﻿
Note: the html_text column shows the text column parsed into HTML format using wandb.html as the original text is essentially in HTML.
💡
﻿
﻿
Some samples from the imdb dataset6
﻿
﻿
﻿
Distribution of the sequence length for the train and evaluation set.6
﻿
MethodologyThe benchmark is the comparison between PyTorch 1.13 and the PyTorch nightly version (namely 2.0.0.dev20230122+cu118). Below, you'll see the salient details here. Also, please note: pure PyTorch was used here–meaning we used no wrappers for our experiments.
System specsAn A100 GPU instance with 40GB of VRAM rented from jarvislabs.ai﻿
Dataset preprocessing The dataset was tokenized with a fixed max-length of 512 across all experiments.
Dynamic-shaped input (aka dynamic padding) couldn't be used for the reasons mentioned in the failed experiment section.
Total training samples: 25,000
Total validation/test samples: 25,000
num_workers was set to 6 for all experiments.
Batch size was set to 16 for full-precision (fp32) experiments and 24 for mixed-precision (fp16) experiments.
Model detailsAll the experiments are performed on bert-large-uncased from the HuggingFace hub.
Models were trained for 3 epochs (for all experiments).
A learning rate of 2e-5 or 3e-5, depending on the batch size, was used.
Experiment details﻿﻿The following experiments are run using a simple training script here. You can view the complete GitHub Repository here.
All of the PyTorch 2.0 experiments are run using the default TorchInductor compiler backend, powered by OpenAI's Triton. However, there are other compiler backends available as well, and you are free to experiment with them as per your needs.
﻿
﻿
﻿
Here's the configuration used for each experiment 👇
﻿
project("gladiator", "PyTorch 2.0 Benchmarks v2").runs.config
 - 5 of 6
run_name
fp16
batch_size
learning_rate
torch_compile
torch_compile_mode
torch_compile_backend
torch_compile_dynamic
max_length
num_epochs
dynamic_padding
model_name_or_path
gradient_checkpointing
seed
1
2
3
4
5
ResultsHere are the metrics we care about, followed by a table showcasing how 2.0 performed. Time is measured in seconds unless specified.
first_iter_time: The time taken for performing first step of training/inference. Useful for determining how long does it take for the model compilation in the compile mode.
avg_iter_time: The average time taken per step after the first step (the first step is excluded from the calculation of this average).
iter_per_sec: Number of training/inference steps per second.
samples_per_sec: Number of data samples processed per second.
total_time: Total time taken for training/inference. Note that this time only includes the time taken purely for training/inference and not the other stages like data processing.
total_time_hr: Converted form of total_time into hh:mm:ss format.
﻿
﻿
Training benchmark results6
﻿
﻿
﻿
Inference benchmark results6
﻿
﻿
﻿
System Metrics6
﻿
Takeaways from the system metrics charts:
PyTorch 2.0 used slightly less GPU memory with same batch size (24) in full precision experiments.
However, in mixed-precision experiments (fp16), PyTorch 2.0 in default compile mode used 82.64 % and 84.03 % in reduce-overhead mode of GPU memory compared to 90.05 % for PyTorch 1.13. We get almost 8% reduction of memory usage with just wrapping the model in torch.compile 🔥!
Detailed breakdown of the resultsLet's breakdown all of the above results one-by-one.
Full precision (fp32) training comparisons﻿
project("gladiator", "PyTorch 2.0 Benchmarks v2").runs.summary
 - 2 of 2
run_name
train/first_iter_time
train/avg_iter_time
train/iter_per_sec
train/samples_per_sec
train/total_time
train/total_time_hr
train/loss
train/accuracy
4
6
Takeaways:
Model takes almost a minute to compile in default mode (measured by first iteration time).
The training time is reduced by 12.48 % compared to training without torch.compile in full precision settings.
Full precision (fp32) inference comparisons﻿
project("gladiator", "PyTorch 2.0 Benchmarks v2").runs.summary
 - 2 of 2
run_name
infer/first_iter_time
infer/avg_iter_time
infer/iter_per_sec
infer/samples_per_sec
infer/total_time
infer/total_time_hr
test/accuracy
4
6
Takeaways:
Model takes around 15 seconds to compile in default mode as measured by first iteration time (41x more than compared to without compilation).
The inference time is reduced by 10.72 % compared to inference without torch.compile.
Mixed precision (fp16) training comparisons﻿
project("gladiator", "PyTorch 2.0 Benchmarks v2").runs.summary
 - 3 of 3
run_name
train/first_iter_time
train/avg_iter_time
train/iter_per_sec
train/samples_per_sec
train/total_time
train/total_time_hr
train/loss
train/accuracy
2
3
5
Takeaways:
Model takes 73 seconds to compile in default mode while takes slightly less (54 seconds) in reduce-overhead mode.
The training time is reduced by 28.6 % in default mode and by 29.91 % in reduce-overhead mode compared to training without torch.compile. 
Now we are seeing the real numbers in mixed-precision training 😎. This is super-cool as we just have to add a single line of code to wrap our model and enjoy almost 30 % reduction in training time.
Mixed precision (fp16) inference comparisons﻿
project("gladiator", "PyTorch 2.0 Benchmarks v2").runs.summary
 - 3 of 3
run_name
infer/first_iter_time
infer/avg_iter_time
infer/iter_per_sec
infer/samples_per_sec
infer/total_time
infer/total_time_hr
test/accuracy
2
3
5
﻿
Takeaways:
Model takes around 19 seconds to compile in default and reduce-overhead mode.
The inference time is reduced by 26.19 % in default mode and by 26.6 % in reduce-overhead mode compared to inference without torch.compile. 
Failed experimentsSince PyTorch 2.0 is fairly new and not a stable version, there were many failed experiments. Here's a list of them:
Models like RoBERTa, ALBERT, DeBERTaV3, Funnel transformer, DistilBERT didn't work with torch.compile. Some models caused segmentation fault error or OOM.
max-autotune mode raised segmentation fault error.
Gradient checkpointing didn't work (similar segmentation fault error).
Dynamic padding didn't work. It was announced in the conference that the support for dynamic shape is very early and not ready for consumption (it will be ready when they release a stable version).
ConclusionPyTorch 2.0 looks really promising. 
In this article, we saw how we can use PyTorch 2.0. It is a powerful tool to speed up your training/inference times with just the addition of one extra line of code. Furthermore, PyTorch 2.0 is fully backward compatible. Although there are some models and modes which doesn't work for now, it will be supported in the near future. The future seems exciting for all the users of PyTorch, be they ML Engineers, compiler/hardware engineers, or code contributors.
﻿
Add a comment
Tags: Articles, Domain Agnostic, Experiment, Intermediate, Panels, Plots, PyTorch
Iterate on AI agents and models faster. Try Weights & Biases today.
Is PyTorch 2.0 Faster Than PyTorch 1.13?

Table of contents

Background to PyTorch 2.0

Install PyTorch 2.0

The Dataset We'll be Experimenting with

Methodology

System specs

Dataset preprocessing

Model details

Experiment details﻿﻿

Results

Detailed breakdown of the results

Full precision (fp32) training comparisons

Full precision (fp32) inference comparisons

Mixed precision (fp16) training comparisons

Mixed precision (fp16) inference comparisons

Failed experiments

Conclusion

Experiment details